Text Splitter & Chunker for RAG / LLMs
Pricing
from $5.00 / 1,000 text chunkeds
Text Splitter & Chunker for RAG / LLMs
Split text into clean, overlapping chunks ready for embeddings, vector databases, RAG and LLM context. Configurable size, overlap, and split strategy.
Pricing
from $5.00 / 1,000 text chunkeds
Rating
0.0
(0)
Developer
Rosario Vitale
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Split any text into clean, overlapping chunks that are ready for embeddings, vector databases, RAG pipelines and LLM context windows — without writing your own splitter.
Paste text (or send many documents), pick a chunk size and overlap, and get back tidy chunks with character counts and approximate token counts as JSON or CSV.
Why
Every RAG / LLM pipeline needs chunking, and everyone re-implements the same fiddly logic: respect paragraph and sentence boundaries, keep an overlap so context isn't lost, normalize messy whitespace, and estimate tokens. This Actor does it for you, reliably, in one call.
Features
- ✂️ Smart chunking — packs text up to your target size while respecting paragraph/sentence boundaries.
- 🔁 Overlap — keeps a configurable overlap so ideas spanning a boundary aren't lost.
- 🔢 Characters or tokens — size and overlap in characters or approximate tokens (~4 chars/token).
- 🧹 Cleaning — normalizes whitespace and collapses excessive blank lines.
- 📦 Batch — split many documents in a single run.
- 📊 Token estimate — every chunk includes
charCountandapproxTokens.
Input
| Field | Type | Description |
|---|---|---|
text | string | A single document to split. |
texts | array | Multiple documents (one per item). |
chunkSize | integer | Target chunk size. Default 1000. |
chunkOverlap | integer | Overlap between chunks. Default 100. |
unit | select | characters or tokens. Default characters. |
splitBy | select | paragraph, sentence or character. Default paragraph. |
clean | boolean | Normalize whitespace. Default true. |
Example input
{"text": "Your long document text goes here...","chunkSize": 1000,"chunkOverlap": 100,"unit": "characters","splitBy": "paragraph","clean": true}
Output
One dataset item per chunk:
{"sourceIndex": 0,"chunkIndex": 0,"totalChunks": 3,"text": "Retrieval-Augmented Generation (RAG) combines a language model ...","charCount": 312,"approxTokens": 78}
Export as JSON, CSV, or Excel, or pull via the Apify API — then send the chunks straight to your embeddings model or vector DB.
Common use cases
- Prepare documents for embeddings + vector search (Pinecone, Qdrant, Weaviate, pgvector).
- Build RAG context for ChatGPT/Claude apps.
- Fit long content into LLM context windows.
- Pairs perfectly with PDF to Structured Data — extract text from PDFs, then chunk it here.
Notes
- Token counts are an estimate (~4 characters per token); exact tokenization depends on the model.
- For
charactersplit mode the text is hard-cut at the size boundary;paragraph/sentencerespect natural boundaries.